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Abstract 

The  Fast  Fourier  Transform  (FFT)  plays  a  key  role  in  many  areas  of  computational  science 
and  engineering.  Although  most  one-dimensional  FFT  problems  can  be  solved  entirely  in  main 
memory,  some  important  classes  of  applications  require  out-of-core  techniques.  For  these,  use  of 
parallel  I/O  systems  can  improve  performance  considerably.  This  paper  shows  how  to  perform 
one-dimensional  FFTs  using  a  parallel  disk  system  with  independent  disk  accesses.  We  present 
both  analytical  and  experimental  results  for  performing  out-of-core  FFTs  in  two  ways:  using 
traditional  virtual  memory  with  demand  paging,  and  using  a  provably  asymptotically  optimal 
algorithm  for  the  Parallel  Disk  Model  (PDM)  of  Vitter  and  Shriver.  When  run  on  a  DEC  2100 
server  with  a  large  memory  and  eight  parallel  disks,  the  optimal  algorithm  for  the  PDM  runs 
up  to  144.7  times  faster  than  in-core  methods  under  demand  paging.  Moreover,  even  including 
I/O  costs,  the  normalized  times  for  the  optimal  PDM  algorithm  are  competitive,  or  better  than, 
those  for  in-core  methods  even  when  they  run  entirely  in  memory. 
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1  Introduction 


Fourier  analysis  plays  a  pivotal  role  in  many  branches  of  science  and  engineering.  The  Fourier 
transform’s  input  is  an  iV'-vector  of  complex  numbers,  representing  some  discretized  function.  The 
Fourier  representation  of  this  function  is  a  sum  of  N  weighted  sine  and  cosine  functions  with 
specific  frequencies.  Computing  the  coefficients  of  the  constituent  functions  yields  a  great  deal  of 
information  about  the  function.  Well-known  Fast  Fourier  Transform  (FFT)  techniques  accomplish 
the  computation  in  ©(iVlgiV)  operations. 

Since  the  modern  discovery  of  the  FFT  by  Cooley  and  Tukey  in  1965  [CT65],  a  profusion  of  FFT 
methods  have  been  developed,  primarily  to  optimize  it  for  different  types  of  computer  architectures 
such  as  vector  and  parallel  machines  (e.g.,  see  Van  Loan  [Van92]).  The  work  we  present  here 
continues  in  that  vein,  looking  at  ways  of  organizing  an  FFT  computation  to  take  advantage  of 
parallel  I/O  systems.  Of  course,  such  an  endeavor  is  useful  only  if  the  input  vector  is  too  large  to 
fit  in  the  main  memory  of  a  computer;  in  most  uses  of  the  FFT,  the  input  vector  will  fit  in  core. 

Some  critical  applications  require  extremely  large  one-dimensional  FFTs,  particularly  when  the 
subject  function  exhibits  critical  phenomena  at  vastly  different  time  scales  and  high  resolution  is 
required.  One  such  application  is  seismic  analysis  [Cla85],  where  an  out-of-core  one-dimensional 
FFT  is  necessary  (as  part  of  a  higher  dimensional  FFT)  even  when  the  computer  memory  has 
16  gigabytes  of  available  RAM  [Rut96].  Another  application  is  in  the  area  of  radio  astronomy. 
The  High-Speed  Data  Acquisition  and  Very  Large  FFTs  Project  at  Caltech^  uses  FFTs  to  sup¬ 
port  searching  for  fast  (millisecond  period)  pulsars.  The  project  currently  requires  FFTs  with  10 
gigapoints,  and  it  desires  FFTs  with  up  to  64  gigapoints.  Yet  another  application  is  for  integer 
multiplication  of  very  large  numbers  [CF94],  which  is  a  key  component  in  the  most  modern  methods 
of  searching  for  Mersenne  prime  numbers.  FFTs  are  used  in  many  ways  to  manipulate  data  sets, 
such  as  convolution/deconvolution,  correlation/au to- correlation,  filtering,  and  power  spectrum  es¬ 
timation  [PFTV88].  Any  time  the  data  set  is  very  large  and  accuracy  is  essential,  very  large  FFTs 
are  required. 

The  contribution  of  the  present  paper  is  to  present  an  out-of-core  FFT  algorithm  that  exploits 
parallel  I/O  and  to  assess  its  performance.  The  algorithm  is  a  variant  of  one  that  was  sketched  by 
Vitter  and  Shriver  [VS94],  and  which  achieves  the  lower  bound  on  complexity  proven  by  Aggarwal 
and  Vitter  [AV88].  In  particular,  we  show  how  efficient  out-of-core  permutation  routines  can  be  used 
throughout  the  FFT  computation.  We  assess  performance  by  comparison  with  demand  paging;  we 
show'  analytically  and  experimentally  that  weU-known  in-core  FFT  algorithms  run  slowly  once  the 
data  set  size  exceeds  available  in-core  memory.  Using  only  a  single-disk  system,  we  observe  that 
our  out-of-core  method  runs  over  46  times  faster  than  demand  paging;  with  eight  disks  we  observe 
up  to  two  orders  of  magnitude  improvement  using  our  technique. 

The  remainder  of  this  paper  is  organized  as  follows.  Section  2  summarizes  some  FFT  methods 
for  in-core  computation,  and  Section  3  discusses  published  out-of-core  FFT  methods  for  single-disk 
systems.  Section  4  demonstrates  why  conventional  demand-paged  in-core  FFT  algorithms  perform 
badly  when  the  problem  size  exceeds  the  physical  memory.  In  Section  5,  we  define  the  Parallel 
Disk  Model  (PDM).  Section  6  describes  our  out-of-core  algorithm.  Section  7  presents  and  analyzes 
running  times  for  our  FFT  implementation  on  two  different  DEC  Alpha-based  uniprocessor  systems. 
Finally,  we  summarize  in  Section  8. 


^  See  http :  / /wbw  .  cacr .  caltech .  edu/ SIO/ APPL/phy02 .  html. 
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2  In-core  FFTs 


This  section  reviews  Fourier  transforms  and  outlines  some  well-known  FFT  methods  for  in-core 
computation.  For  further  background  on  the  FFT,  see  any  of  the  texts  [CLR90,  Nus82,  Van92]. 

Discrete  Fourier  transforms 

Fourier  transforms  are  based  on  complex  roots  of  unity.  The  principal  Nth  root  of  unity  is  a 
complex  number  where  i  =  For  any  real  number  u,  e*“  =  cos(u)  -f  i  sin(u). 

Given  a  vector  a  =  (ao?oi5  •  -  -lO-N-i)-!  where  iV  is  a  power  of  2,  the  Discrete  Fourier  Transform 
(DFT)  is  a  vector  y  =  iyo,yi,  ■  ■  ■,  Vn-i)  for  which 


N-i 

Vk  =  Yi 

j=o 


for  A:  =  0, 1, . .  .,iV  —  1  . 


(1) 


We  also  write  y  =  DFT;v(o)- 

Fast  Fourier  Transforms 

Viewed  merely  as  a  linear  system,  0(iV^)  time  is  needed  to  compute  vector  y.  The  well-known  Fast 
Fourier  Transform  technique  requires  only  0(iVlgiV)  time,  as  follows.  Splitting  the  summation  in 
equation  (1)  into  its  odd-  and  even-indexed  terms,  we  have 

N/2-1  N/2-1 

yk  —  ^Nl2^‘^i  • 

j=0  j=0 

Each  of  these  sums  is  itself  a  DFT  of  a  vector  of  length  iV/2.  When  Q  <  k  <  Nf2,  it  is  easy  to 
see  how  to  combine  the  results  of  these  smaller  DFTs.  When  N/2  <  k  <  N,  it  is  easy  to  show 
that  HsJioe,  we  can  compute  y  =  DFTjv(a)  by  the  following 

recursive  method: 

1.  Split  a  into  a®''®"  =  (uo,  ^2?  •  •  •  ? fljV-2)  and  =  (oi,  03, . . . ,  aAr-i). 

2.  Recursively  compute  =  DFT;v/2(®*'"®")  and  y°^^  =  DFT;v/2(®°‘^'^)- 

3.  For  A:  =  0, 1, ... ,  N/2  -  1,  compute  yk  =  yl''^^  -f  yk+N/2  =  The 

factor  is  often  referred  to  as  a  twiddle  factor. 

By  fully  unrolling  the  recursion,  we  can  view  the  FFT  computation  as  Figure  1  shows.  First, 
the  input  vector  undergoes  a  bit-reversal  permutation,  and  then  a  butterfly  graph  of  Ig  N  stages  is 
computed.  A  bit-reversal  permutation  is  a  bijection  in  which  the  element  whose  index  k  in  binary  is 
feiv-i,  ^Ar-27  ••■7^0  maps  to  the  element  whose  index  in  binary  is  ko,  fci, . . . ,  Arjv-i.  In  the  sth  stage 
of  the  butterfly  graph,  elements  whose  indices  are  2®  apart  (after  the  bit-reversal  permutation) 
participate  in  a  butterfly  operation,  as  described  in  step  3  above.  The  butterfly  operations  in  the 
sth  stage  can  be  organized  into  N/2^  groups  of  2®  operations  each. 

FFT  algorithms 

When  the  FFT  is  computed  according  to  Figure  1  in  a  straightforward  manner — left  to  right  and 
top  to  bottom — the  result  is  the  classic  Cooley- Tukey  FFT  method  [CT65].  Several  other  methods 
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Figure  1:  The  FFT  computation  after  fully  unrolling  the  recursion,  shown  here  with  N  =  S.  Inputs  (ao.oi, 
enter  from  the  left  and  first  undergo  a  bit-reversal  permutation.  Then  Ig  TV  =  3  stages  of  butterfly 
operations  are  performed,  and  the  results  {yo,  yi,  •  •  •,  yN-i)  emerge  from  the  right.  This  figure  is  taken  from 
[CLR90,  p.  796]. 

have  been  developed  to  improve  performance  on  vector  machines  and  in  memory  hierarchies,  by 
avoiding  the  bit-reversal  permutation  to  improve  locality  of  reference. 

Stockham’s  method  [Van92,  pp.  49-58]  eliminates  bit-reversal  by  permuting  the  N  values  before 
each  of  the  IgiV  stages  of  the  butterfly  network.  Its  memory  requirement,  however,  is  twice  that 
of  the  Cooley- Tukey  method. 

Another  method,  attributed  by  Bailey  [Bai90]  to  P.  Swarztrauber  as  a  variation  of  an  algo¬ 
rithm  by  Gentleman  and  Sande,  and  also  attributed  to  E.  Granger  by  Brenner  [Bre69],  splits  the 
summation  of  equation  (1)  into  y/N  summations  each  with  -/W  terms.  (Here  we  take  iV  to  be  a 
power  of  4,  but  the  method  can  be  generalized).  We  split  into  y/N  DFTs  rather  than  two;  each 
DFT  is  comprised  of  all  terms  whose  indices  are  congruent  modulo  y/W.  The  analog  of  a  butterfly 
operation  adds  y/W  terms  (expressible  as  DFTs)  that  are  computed  by  recursive  calls  to  problems 
of  size  y/W.  This,  Swarztrauber’s  method,  is  given  by  the  following  steps,  which  operate  in  place: 

1.  Treating  the  vector  a  =  (oo,ai, . .  .,Ojv-i)  as  a  y/W  x  y/W  matrix  stored  in  row-major  order, 
transpose  it  so  that  elements  whose  original  indices  are  congruent  modulo  y/W  appear  in  the 
same  row. 

2.  Compute  the  DFT  of  each  y/W -element  row  individually. 

3.  Scale  the  resulting  matrix  by  multiplying  the  entry  in  row  j  and  column  khy 

4.  Transpose  the  matrix. 

5.  Compute  the  DFT  of  each  y/W -element  row  individually. 

6.  Transpose  the  matrix  and  interpret  it  once  again  as  an  iV-element  vector  to  produce  the  result 

y  =  {yo,yi,-  ■  -  yVN-i)- 

This  method  runs  in  time  ©(AlgiV).  Rehance  on  smaller  DFTs  improves  locality  in  memory 
hierarchies.  Experiments  reported  in  Section  4  show  this  method  to  be  nearly  twice  as  fast  as 
others  on  in-core  computations. 
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3  Out-of-core  FFTs 


Here  we  briefly  survey  published  out-of-core,  single-disk,  one- dimensional  FFT  algorithms. 

Note  that  an  out-of-core  method  based  on  Swarztrauber’s  method  is  easy  when  M  <  N  < 
because  each  ^/W -sized  DFT  fits  in  memory.  This  relation  between  M  and  N  is  entirely  reasonable 
given  contemporary  memory  sizes  and  prices.  The  method  does  require  an  out-of-core  matrix- 
transpose  subroutine  to  accomplish  steps  1,  4,  and  6.  Bailey  recommends  an  algorithm  by  Fraser 
[Fra76]  for  BPC  (bit-permute/complement)  permutations  on  one  disk,  whereas  Brenner  details  a 
transposition  algorithm. 

When  the  problem  size  just  barely  exceeds  the  memory  size,  Brenner  suggests  a  method  devel¬ 
oped  by  W.  Ryder.  This  method,  which  is  a  specialization  of  Swarztrauber’s  method,  eliminates 
the  first  two  matrix  transpositions.  The  cost  of  doing  so,  however,  is  that  the  computation  time 
contains  a  term  proportional  to  so  that  if  iV  >  M,  the  computation  time  is  very  high. 

Sweet  and  Wilson  [SW95]  use  an  extension  of  Swarztrauber’s  method  to  perform  FFTs  even 
when  iV  >  on  the  CM-5  using  a  Scalable  Disk  Array  (SDA)  [TMC92],  which  appears  to  the 
programmer  as  one  large  disk.  The  method  used  by  Sweet  and  Wilson  requires  an  out-of-core 
bit-reversal  permutation,  and  they  use  Fraser’s  algorithm. 

The  algorithm  we  present  in  Section  6  fleshes  out  the  details  of  a  sketch  given  by  Vitter  and 
Shriver  [VS94].  Because  they  focus  on  pebbling  the  butterfly  graph,  some  essential  steps  are  omitted 
from  their  description  (e.g.,  the  implementation  of  an  efficient  out-of-core  bit-reverse  permutation); 
nevertheless  their  paper  is  properly  viewed  as  the  basis  for  our  work. 

4  Performance  of  FFTs  with  demand  paging 

In  this  section,  we  show  that  the  in-core  FFT  methods  described  earlier  perform  poorly  under 
demand  paging  once  the  problem  size  exceeds  the  available  memory.  In  particular,  we  show  that 
the  number  of  page  faults  for  the  Cooley- Tukey  bit-reversal  computation  is  proportional  to  N 
and  that  even  under  the  best  of  conditions  the  butterfly  steps  for  all  methods  suffer  from  a  poor 
computation-to-I/0  ratio.  We  substantiate  our  conclusions  with  experimental  results. 

Analysis  of  bit  reversal 

The  following  pseudocode  expresses  an  in-place  bit-reversal  permutation  of  A-element  array  A: 

for  j  ^  0  to  A  —  1 

do  let  f  be  the  Ig  A’-bit  reversal  of  j 

if  j  <  f 

then  exchange  A\j]  ^  A[j'] 

Theorem  1  Suppose  that  the  in-place  bit-reversal  permutation  code  above  is  performed  under  de¬ 
mand  paging  with  least-recently-used  page  replacement.  Suppose  further  that  there  are  A  =  2” 
elements  in  the  array,  the  physical  memory  can  hold  M  =  2’"  elements,  and  each  page  holds 
B  =  2^  elements,  where  n,  m,  and  b  are  positive  integers,  A  >  2M,  and  A  >  2B.  Finally,  assume 
that  the  array  A  starts  at  a  page  boundary  and  that  no  pages  of  A  are  initially  in  memory.  Then 
the  bit-reversal  permutation  induces  at  least  N/A  page  faults. 

Proof:  We  will  show  that  each  element  of  the  set  F  =  {j  :  0  <  j  <  N/2  and  j  is  odd}  induces  a 
page  fault.  Noting  that  |F|  =  A/4  will  then  prove  the  theorem. 
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Observe  that  for  each  element  j  G  F,  we  have  j  <  f,  since  j  has  a  most  significant  bit  of  0  and 
a  least  significant  bit  of  1.  Thus,  the  exchange  of  A[j]  and  A[j']  will  occur  for  each  j  €  F.  Let  F' 
be  the  set  of  destination  pages  referenced  when  processing  members  of  F. 

We  compute  which  page  an  element  is  on  as  follows.  For  a  given  n-bit  index  into  A,  the  least 
significant  b  bits  give  the  position  on  the  page,  and  the  most  significant  n  -  b  bits  give  the  page 
number.  Thus,  the  elements  of  A  that  are  destined  for  the  same  page  p  have  the  same  value  in 
their  least  significant  source  indices. 

To  determine  whether  a  given  reference  to  A[j']  causes  a  page  fault,  we  compute  the  “stack 
distance”  for  the  page  containing  A[j'].  The  stack  distance  [MGST70]  of  a  reference  to  page  p 
is  one  plus  the  number  of  uniquely  different  pages  referenced  since  the  most  recent  reference  to 
page  p.  A  reference  to  a  page  causes  a  page  fault  if  and  only  if  the  stack  distance  of  that  reference 
exceeds  the  number  of  pages  that  memory  can  hold,  which  is  exactly  MjB.  By  our  assumption 
that  no  pages  of  A  are  initially  in  memory,  we  consider  the  stack  distance  to  the  first  reference  to 
a  page  of  A  to  be  infinite. 

Next  we  show  that  for  each  page  p  €  F' ,  as  we  progress  through  the  values  j  =  0,1,...,  N/2- 1, 
the  stack  distance  between  successive  references  to  page  p  is  greater  than  iV/25.  Once  a  reference 
is  made  to  destination  page  p,  another  N/B  -  1  values  of  j  will  be  considered  before  the  next 
reference  to  page  p.  Of  these,  iV/25  —  1  are  in  F  and  thus  cause  a  reference  to  a  unique  destination 
page  in  F' .  The  page  containing  index  j  is  also  referenced,  and  this  page  is  not  in  F',  and  so  at 
least  N/2B  distinct  pages  are  referenced.  As  long  as  no  value  j  e  F  resides  on  the  same  page  as  its 
destination  index  f,  the  stack  distance  between  successive  references  to  page  p  is  strictly  greater 
than  N/2B.  But  because  N  >  2B,  there  are  at  least  two  pages  in  the  array  A,  and  because  A  also 
starts  at  a  page  boundary,  no  element  in  the  first  N/2  positions  resides  on  the  same  page  as  an 
element  in  the  last  N/2  positions.  Since  each  element  j  e  F  is  in  the  first  N/2  positions  and  maps 
to  an  element  in  the  last  N/2  positions,  we  conclude  that  the  stack  distance  is  indeed  greater  than 
N/2B. 

Because  N  >  2M,  we  have  that  N/2B  >  M/B,  and  so  each  reference  to  a  page  of  F'  causes  a 
page  fault.  Since  references  to  pages  in  F'  are  induced  by  source  elements  in  F,  we  see  that  each 
time  a  member  of  F  is  processed,  a  page  fault  ensues,  which  completes  the  proof.  ■ 

The  proof  of  Theorem  1  substantially  undercounts  page  faults.  A  more  extensive  analysis  using 
similar  ideas  shows  that  the  number  of  page  faults  is  at  least  (N/2  —  2y/N){l  —  2/(N/M)^). 

Analysis  of  butterfly  stages 

All  of  the  FFT  methods  that  we  have  discussed  exhibit  relatively  good  locality  when  executing 
each  butterfly  stage.  For  both  Cooley- Tukey  and  Stockham,  each  butterfly  stage  essentially  sweeps 
through  all  the  data  pages,  exactly  once,  with  no  more  than  2  data  pages  actively  in  use  at  a  time. 
Swarztrauber’s  method  exhibits  more  complex  behavior  because  of  the  matrix  transposes,  but  its 
constituent  butterflies  act  like  the  other  two  methods.  The  essential  point  to  be  noted  is  that  during 
a  butterfly  stage,  each  data  point  is  updated  once  by  a  complex  addition/ subtraction  (two  floating¬ 
point  operations),  and  half  the  data  points  also  involve  a  complex  multiplication  (six  floating-point 
operations).  A  typical  8  KB  data  page  contains  512  points,  and  so  it  entails  2560  floating-point 
operations.  The  time  required  to  fault  in  a  data  page  is  on  the  order  of  10“^  seconds  (most  of  which 
is  independent  of  the  page  size),  but  the  time  to  process  that  page  is  about  an  order  of  magnitude 
less.  Even  with  much  better  locality  than  the  bit-reversal  computation,  demand-paged  FFT  suffers 
greatly  from  waiting  for  I/O  to  complete.  We  can  mitigate  this  bottleneck  by  either  increasing  the 
size  of  block  fetched  per  I/O,  and/or  by  prefetching  memory  blocks.  Our  out-of-core  technique 
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Method 

Cooley-Tukey  ^ 

Stockham 

Swarztrauber 

Problem  size 

Seconds 

Normalized 

Seconds 

Normalized 

Seconds 

Normalized 

N  =  2^® 

1.54558 

1.47398 

2.15616 

2.05627 

1.1.3762 

1.08492 

N  =  2^^ 

N  =  21* 

3.36112 

7.25278 

1.50843 

1.53707 

4.64709 

9.85693 

2.08556 

2.08896 

5.01269 

1.06233 

N  =  2i® 

N  =  2^° 

15.5745 

.35.42.36 

1.56347 

1.68913 

20.9941 

44.6760 

2.10753 

2.13032 

24.6-568 

1.17573 

N  =  2^1 

N  =  222 

75.1581 

11591.7 

1.706.58 

125.621 

972.0.35 

2022.26 

22.0715 

21.9157 

443.147 

4.80248 

N  =  22* 

N  =  224 

42553.5 

220.555 

4097.72 

8746.85 

21.2385 

21.7230 

2226.75 

5.53019 

Table  1;  Running  times  for  the  three  in-core  FFT  methods  on  the  workstation  zayante,  with  64  MB  of 
memory.  For  each  method  and  problem  size,  we  show  the  time  in  seconds  and  also  the  normalized  time 
(italics,  in  microseconds)  which  is  the  running  time  divided  by  NlgN. 


does  both. 

Experimental  results 

Here  we  present  running  times  of  the  three  demand-paged  in-core  FFT  methods  ( Cooley- Tukey, 
Stockham,  and  Swarztrauber).  They  were  coded  in  C,  compiled  using  gcc  with  02  optimization, 
and  run  on  a  DEC  3000  Alpha-based  workstation  running  Digital  UNIX  V3.2C.  The  workstation, 
named  zayante,  has  a  clock  cycle  of  175  MHz,  64  MB  of  memory,  and  a  512  MB  virtual-address 
space. 

Table  1  gives  running  times.  The  Cooley-Tukey  and  Swarztrauber  methods  both  use  16iV 
bytes;  Stockham  uses  32A'  and  so  experiences  heavy  paging  one  problem  size  earlier  than  the 
others.  Because  our  implementation  of  Swarztrauber’s  method  requires  N  to  be  a  power  of  4, 
timings  for  odd  powers  of  2  are  omitted. 

From  Table  1,  we  see  the  effects  of  demand  paging.  By  avoiding  bit-reversal,  the  Stockham 
and  Swarztrauber  methods  do  not  experience  the  degree  of  thrashing  suffered  by  Cooley-Tukey. 
(In  fact,  we  did  not  even  run  Cooley-Tukey  for  N  =  2'^^,  anticipating  a  run  time  of  about  a  day.) 
Swarztrauber’s  method  is  notably  faster  in  each  case,  probably  due  to  its  substantially  better 
locality  in  cache.  Nevertheless,  we  shall  see  in  Section  (  that  our  explicit  out-of-core  algorithm 
runs  faster  than  Swarztrauber’s  method  on  the  same  system  for  a  problem  size  of  iV  =  2^^. 

5  The  Parallel  Disk  Model 

This  section  describes  the  Parallel  Disk  Model  [VS94].  We  shall  use  this  model  in  Section  6  to 
design  an  out-of-core  FFT  algorithm. 

In  the  Parallel  Disk  Model,  or  PDM,  N  records  are  stored  on  D  disks  Vo,Vi,.  with 

Nj D  records  stored  on  each  disk.  For  our  purposes,  a  record  is  a  complex  number  comprised  of  two 
8-byte  double-precision  floats.  The  records  on  each  disk  are  partitioned  into  blocks  of  B  records 
each.2  Any  disk  access  transfers  an  entire  block  of  records.  Disk  I/O  transfers  records  between 

block  might  consist  of  several  sectors  of  a  physical  device  or,  in  the  case  of  RAID  [CGK+88,  Gib92,  PGK88], 
sectors  from  several  physical  devices. 
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Figure  2:  The  layout  of  iV  =  64  records  in  a  parallel  disk  system  with  5  =  2  and  D  =  8.  Each  box 
represents  one  block.  The  number  of  stripes  is  NjBD  =  4.  Numbers  indicate  record  indices. 


the  disks  and  an  M-record  random-access  memory.  Any  set  of  M  records  is  a  memoryload.  Each 
parallel  I/O  operation  transfers  up  to  D  blocks  between  the  disks  and  memory,  with  at  most  one 
block  transferred  per  disk,  for  a  total  of  up  to  BD  records  transferred.  The  most  general  type  of 
parallel  I/O  operation  is  independent  I/O,  in  which  the  blocks  accessed  in  a  single  parallel  I/O  may 
be  at  any  locations  on  their  respective  disks.  A  more  restricted  operation  is  striped  I/O,  in  which 
the  blocks  accessed  in  a  given  operation  must  be  at  the  same  location  on  each  disk. 

We  assess  an  algorithm  by  the  number  of  paraUel  I/O  operations  it  requires.  While  this  does 
not  account  for  unavoidable  variation  in  disk- access  times,  the  number  of  disk  accesses  can  be 
minimized  by  carefully  designed  algorithms. 

We  place  some  restrictions  on  the  PDM  parameters.  We  assume  that  B,  D,  M,  and  N  are 
exact  powers  of  2.  For  convenience,  we  define  6  =  Ig  5,  m  =  Ig  M,  and  n  =  Ig  AT.  We  assume  that 
BD  <  M  in  order  to  fully  utilize  disk  bandwidth,  and  of  course  we  assume  that  M  <  N. 

The  PDM  lays  out  data  on  a  parallel  disk  system  as  shown  in  Figure  2.  A  stripe  consists  of  the 
D  blocks  at  the  same  location  on  all  D  disks.  A  record’s  index  is  an  n-bit  vector  x  with  the  least 
significant  bit  first:  x  =  (a:o,a;i, . .  .,a;„_i).  Record  indices  vary  most  rapidly  within  a  block,  then 
among  disks,  and  finally  among  stripes.  The  most  significant  n  —  m  bits  of  an  index  indicate  its 
memoryload  number. 

Since  each  paraUel  I/O  operation  accesses  at  most  BD  records,  any  algorithm  that  must  access 
aU  N  records  requires  Il{NIBD)  paraUel  I/Os,  and  so  0{NfBD)  parallel  I/Os  is  the  analogue  of 
linear  time  in  sequential  computing.  The  FFT  algorithm  we  implemented  has  an  1/ 0  complexity  of 
0  which  appears  to  be  the  analogue  of  the  e{NlgN)  bound  seen  for  so  many 

sequential  algorithms  on  the  standard  RAM  model. 

6  An  explicit  out-of-core  FFT  algorithm  for  the  PDM 

By  taking  fuU  advantage  of  a  paraUel  disk  system,  we  can  get  considerably  better  out-of-core  FFT 
performance  than  we  get  by  using  just  demand  paging.  This  section  presents  an  expUcit  out-of-core 
FFT  algorithm  designed  for  the  PDM.  The  key  idea  is  to  redraw  the  butterfly  graph  by  inserting 
permutations.  We  then  recognize  that  bit-reversal  and  the  added  permutations  belong  to  the  larger 
class  of  BMMC  permutations.  We  use  a  prior  out-of-core  BMMC  algorithm  to  produce  an  efficient 
out-of-core  FFT. 

BMMC  permutations  on  the  PDM 

A  BMMC  (bit-matrix-multiply/complement)  permutation  on  A  =  2”  elements  is  specified  by  an 
n  X  n  characteristic  matrix  H  =  {hij)  whose  entries  are  drawn  from  {0, 1}  and  is  nonsingular  (i.e.. 


invertible)  over  GF{2).^  The  specification  also  includes  a  complement  vector  c  —  (cq,  ci, . . . ,  c„_i)  of 
length  n.  Treating  each  source  index  x  as  an  n-bit  vector,  we  perform  matrix- vector  multiplication 
over  GF{2)  and  then  form  the  corresponding  n-bit  target  index  2  by  complementing  some  subset  of 
the  resulting  bits:  z  =  Hx®c.  As  long  as  the  characteristic  matrix  H  is  nonsingular,  the  mapping 
of  source  indices  to  target  indices  is  one-to-one. 

A  very  efficient  algorithm  for  BMMC  permutations  on  the  PDM  appears  in  [CSW94].  This 
algorithm  requires  at  most  ^  P^^raUel  I/Os,  where  7  is  the  lower  left  lg{N/B)  x 

Ig  B  submatrix  of  the  characteristic  matrix,  and  the  rank  is  computed  over  GF{2).  (Note  that 
because  of  the  dimensions  of  7,  its  rank  is  at  most  lgmin(iV/5,5).)  This  number  of  factors  is 
asymptotically  optimal  and  is  very  close  to  the  best  known  exact  lower  bound. 

We  shall  use  two  types  of  BMMC  permutations  to  perform  the  FFT.  Both  use  a  complement 
vector  that  is  all  Os. 

Bit-reversal  permutation:  The  characteristic  matrix  has  Is  on  the  antidiagonal  and  Os  else¬ 
where.  The  submatrix  7  has  as  much  rank  as  possible,  so  that  rank  7  =  lgmm{B,N/B). 

fc-bit  right-rotation:  We  rotate  the  bits  of  each  index  k  bits  to  the  right,  wrapping  around  at 
the  rightmost  position.  The  characteristic  matrix  is  formed  by  taking  the  identity  matrix  and 
rotating  its  columns  k  positions  to  the  right,  and  rank  7  <  min(A;,lg.B,lg(A/5)). 

Redrawing  the  butterfly 

Figure  3  shows  the  structure  of  our  algorithm.  This  redrawing  of  the  butterfly  was  devised  by 
Snir  [SniSl]  and  is  implicitly  used  in  the  FFT  algorithm  of  Vitter  and  Shriver  [VS94].  Assume  for 
the  moment  that  Ig  M  divides  Ig  N.  As  in  the  Cooley- Tukey  method,  we  start  with  a  bit-reversal 
permutation.  Then  there  are  Ig  N/  Ig  M  superlevels,  where  each  superlevel  consists  of  N/M  separate 
“mini-butterflies”  followed  by  a  (IgM)-bit  right-rotation  permutation  on  the  entire  array. 

Each  mini-butterfly  is  a  butterfly  graph  on  M  values,  and  hence  it  has  depth  IgM  and  a 
sequential  running  time  of  0(MlgM).  The  size  M  of  a  mini-butterfly  is  chosen  so  that  each 
mini-butterfly  is  computed  by  reading  in  a  memoryload,  computing  the  mini-butterfly  graph,  and 
writing  out  the  memoryload. 

Analysis 

This  FFT  algorithm  consists  of  one  bit-reversal  permutation  followed  by  IgiV/lgM  superlevels. 
As  noted  above,  the  bit-reversal  permutation  requires  at  most  ^  "b 

I/Os.  Each  superlevel  requires  2N/BD  parallel  I/Os  to  read  and  write  all  N/M  mini- butterflies 
plus  at  most  ^  ^  2^  parallel  I/Os  to  perform  the  (IgM)-bit  rotation  permu¬ 

tation.  Since  the  PDM  requires  that  BD  <  M  and  D  >  1,  we  have  B  <  M,  and  so  the  IgM 
factor  in  the  numerator  drops  out.  Asymptotically,-  the  number  of  parallel  I/O  operations  is 
®  IgW  ’  ''^hich  can  be  shown  via  simple  manipulations  to  equal  the  lower  bound 

of  proven  by  Aggarwal  and  Vitter  [AV88]. 


^Matrix  multiplication  over  GF{2)  is  like  standard  matrix  multiplication  over  the  reals  but  with  all  arithmetic 
performed  modulo  2.  Equivalently,  multiplication  is  replaced  by  logical- and,  and  addition  is  replaced  by  exclusive-or. 


superlevel  superlevel  superlevel 

-  Ig  AVIg  M  superlevels  - 


Figure  3;  The  structure  of  the  out-of-core  FFT  algorithm  for  the  PDM.  After  a  bit-reversal  permutation, 
we  perform  IgTV/lgiW  superlevels.  Each  superlevel  consists  of  N/M  mini-butterflies  on  M  values,  followed 
by  a  (Ig  M)-bit  right-rotation  permutation  on  the  entire  array. 


Handling  general  values  of  N  and  M 

If  IgM  does  not  tivite  IgiV,  then  we  compensate  in  the  last  superlevel.  Rather  than  computing 
mini-butterflies  of  depth  IgM  in  the  last  superlevel,  we  compute  mini-butterflies  of  depth  r  = 
(Ig  N)  mod  (Ig  M),  which  is  the  number  of  levels  of  the  full  butterfly  graph  not  yet  computed.  We 
can  still  read  and  write  memoryloads  of  M  values,  but  now  each  memoryload  in  the  last  superlevel 
consists  of  MIT'  mini-butterflies. 

Out-of-core  Swarztrauber’s  method 

If  M  <  iV  <  M^,  we  could  use  an  explicit  out-of-core  version  of  Swarztrauber’s  method.  The 
matrix-transpose  steps  are  BMMC  permutations,  since  exchanging  row  and  column  numbers  within 
an  index  is  a  ((Ig  iV)/2)-bit  rotation  permutation.  Thus,  there  would  be  three  BMMC  permutations, 
which  is  just  as  many  as  the  our  algorithm  performs  when  M  <  N  <  M^.  (Our  algorithm  has  the 
further  advantage  of  working  even  when  N  >  M^.) 

Moreover,  the  BMMC  permutations  that  an  out-of-core  Swarztrauber  implementation  would 
perform  are  no  faster  than  those  of  our  algorithm.  When  done  out-of-core,  transposing  a  square 
matrix  takes  just  as  long  as  a  bit-reversal  permutation  or  a  (Ig  M)-bit  right  rotation.  The  in-core 
portions  of  an  out-of-core  Swarztrauber  algorithm  would  also  have  to  perform  in-core  bit-reversal 
permutations,  and  so  they  would  be  slower  than  the  in-core  portions  of  our  algorithm.  Consequently, 
we  did  not  implement  an  explicit  out-of-core  version  of  Swarztrauber’s  method. 

Implementation  notes 

This  section  concludes  with  some  notes  on  the  implementation  of  our  out-of-core  FFT  algorithm. 

We  start  with  the  twiddle  factors,  which  were  omitted  in  the  above  description.  The  butterfly 
operations  in  Figure  3  proceed  in  IgiV  levels  from  left  to  right,  just  as  in  Figure  1.  If  we  number 
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these  levels  from  1  to  IgiV,  then  all  twiddle  factors  of  the  /th  level  are  powers  of  W21.  We  obtain 
these  powers  of  Wji  efficiently  by  directly  computing  the  exponent  of  the  twiddle  factor  in  super¬ 
level  5,  mini-butterfly  q  within  the  superlayer  (starting  from  0,  and  the  range  of  q  depends  on 


M  ns  N/lgM\ 


+  jM^.  This 


the  superlayer),  and  the  jth  butterfly  within  a  group  of  butterflies  as 
computation  is  easy  to  move  into  loops  and  avoids  expensive  sine  and  cosine  calls. 

The  ViC*  interface  [CH96]  provides  the  appearance  of  the  PDM  when  performing  parallel  I/O 
operations.  The  interface  is  portable,  and  it  is  implemented  as  a  set  of  wrappers  on  top  of  an 
existing  serial  or  parallel  file  system.  Here,  we  used  an  implementation  on  top  of  a  traditional 
UNIX  file  system  (UFS),  but  with  multiple  disks. 

The  BMMC  permutation  subroutine  is  taken  from  the  implementation  in  [CH96].  It  calls  the 
ViC*  interface  to  perform  striped  reads  and  independent  writes.  It  is  carefully  optimized  for  both 


in-core  computation  and  I/O. 

Finally,  we  implemented  the  FFT  algorithm  with  both  synchronous  (i.e.,  blocking)  and  asyn¬ 
chronous  (non-blocking)  I/O  calls;  the  ViC*  interface  supports  both.  With  asynchronous  I/O,  as 
we  compute  the  butterflies  of  the  qth  memoryload,  we  simultaneously  prefetch  the  data  of  the 
(q  -1-  l)st  memoryload  and  write  behind  the  computed  data  of  the  {q  —  l)st  memoryload.  The 
reduced  latency  does  not  come  for  free,  however,  as  we  must  allocate  prefetch  and  write-behind 
buffers  of  the  same  size  as  the  compute  buffer.  Thus,  the  effective  memory  size,  i.e.,  the  value  of  M 
used  in  the  algorithm,  is  smaller  with  asynchronous  I/O  than  with  synchronous  I/O.  Because  we 
carve  memory  into  three  parts  and  M  must  be  a  power  of  2,  asynchronous  1/ 0  reduces  the  effective 
memory  size  by  a  factor  of  4.  Context  switching  is  an  additional  cost,  as  one  kernel-level  thread 
serves  each  physical  disk  and  is  switched  in  to  handle  I/O  initiation  and  completion.  Nevertheless, 
we  shall  see  in  Section  7  that  asynchronous  I/O  is  usually  worthwhile. 


7  Performance  of  the  out-of-core  FFT  algorithm 

This  section  presents  timing  results  for  the  out-of-core  FFT  algorithm  on  two  different  DEC  Alpha 
platforms.  In  all  cases,  block  sizes  were  2^®  bytes. 

We  start  with  a  direct  comparison  of  our  algorithm  and  the  in-core  methods  running  with 
demand  paging  on  zayante.  With  our  algorithm,  we  used  D  -  \  disk  and  varied  the  memory  size 
on  zayante  from  2^^  to  2^®.  Using  only  one  disk  for  the  our  algorithm  makes  for  a  fair  comparison 
to  demand  paging,  since  there  is  only  one  swap  disk.  Table  2  shows  running  times  with  both 
synchronous  and  asynchronous  I/O.  In  some  cases,  the  asynchronous  time  exceeds  the  synchronous 
time  because,  we  believe,  having  one  processor  running  both  threads  (main  computation  and  disk 
server)  causes  context  switches  during  butterfly  computations  and  BMMC  permutations.  Also,  in 
some  cases  using  more  memory  does  not  help.  Note,  however,  that  at  the  problem  sizes  at  which  the 
in-core  algorithms  encounter  heavy  paging— A  >  2^^  for  Cooley-Tukey  and  Swarztrauber— our  out- 
of-core  algorithm  is  faster  if  it  has  enough  memory  to  work  with.  (At  N  =  2^^,  our  algorithm  with 
32  MB  of  memory  and  asynchronous  I/O  is  over  46  times  faster  than  Cooley-Tukey.)  Considering 
the  overhead  due  to  the  ViC*  wrappers  and  UFS  calls,  it  is  impressive  that  our  algorithm  can  run 
faster  than  even  Swarztrauber’s  method. 

Table  3  shows  running  times  on  a  different  system,  named  adams,  which  is  a  DEC  2100  server 
with  two  175-MHz  Alpha  processors,  320  MB  of  memory,  and  eight  2-GB  disks  for  data  (so  that 
D  =  8).  It  has  the  same  software  environment  as  zayante,  but  with  eight  disks,  its  1/ 0  bandwidth  is 
much  higher.  Compared  to  the  in-core  methods  in  Table  1  even  when  they  run  entirely  in  memory, 
the  normalized  times  (which  do  include  I/O  time)  are  at  worst  slightly  higher  and  in  some  cases 
even  lower!  In  one  case  {N  =  2^^),  the  running  time  on  adams  is  144.7  times  lower  than  Cooley- 
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Problem 

Memory  size  (bytes) 

size 

222 

223 

224 

225 

(points) 

sync 

async 

sync 

async 

sync 

ctsync 

sync 

async 

222 

440.713 

4-77610 

505.569 

5-47896 

414.395 

4-49088 

372.099 

4-03251 

402.335 

4-36019 

403.913 

4-37729 

437.361 

4-73977 

456.378 

4-94586 

223 

1242.05 

6-43756 

1141.72 

5-91755 

846.030 

4-38498 

779.431 

4-03980 

857.528 

4-44458 

856.236 

4-43788 

931.019 

4-82548 

923.335 

4-78566 

224 

2479.13 

6-15699 

2240.50 

5-56434 

1939.60 

4-81705 

2221.19 

5-51639 

1699.27 

4-22018 

1757.30 

4-36430 

1785.00 

4-43310 

1692.25 

4-20275 

225 

4851.86 

5-78387 

4685.00 

5-58496 

4827.49 

5.75482 

4.573.11 

5-45157 

3412.08 

4-06752 

3461.50 

4-12643 

3- 539.99 

4- 22000 

3318.47 

3.95592 

226 

10714.9 

6-14094 

11772.7 

6-74719 

9558.68 

5-47829 

8911.21 

5-10721 

7689.57 

4-40706 

8751.93 

5-01592 

7495.07 

4-29559 

7581.10 

4-34489 

Table  2:  Running  times  for  the  out-of-core  algorithm  on  zayante  with  one  disk,  varying  problem  and  memory 
sizes,  and  both  synchronous  and  asynchronous  I/O.  Times  are  in  seconds,  and  in  italics  are  the  normalized 
times  (the  running  time  divided  by  N\gN)  in  microseconds. 


,  Memory  size  (bytes) 

Problem  size 

226 

227 

(points) 

sync 

async 

sync 

async 

223 

340.659 

1.76564 

293.921 

1.52340 

224 

799.221 

1.98489 

674.864 

1.67604 

835.317 

2.07453 

714.364 

1.77414 

225 

1718.09 

2.04812 

1541.35 

1.83743 

1712.90 

2.04194 

1458.18 

1. 73829 

226 

3.500.04 

2.00595 

.3092.14 

1.77217 

3496.10 

2.00369 

30-54.42 

1.75055 

227 

72.32.63 

1.99583 

6226.17 

1.71810 

7105.92 

1.96086 

6252.80 

1.72544 

00 

14671.7 

1.95201 

12695.0 

1.68902 

14243.6 

1.89506 

12597.9 

1.67610 

229 

30319.8 

1.94741 

26431.9 

1.69770 

30281.2 

1.94494 

26173.6 

1.68111 

Table  3:  Running  times  for  the  out-of-core  algorithm  on  adams  with  8  disks,  varying  problem  and  memory 
sizes,  and  both  synchronous  and  asynchronous  I/O.  Times  are  in  seconds,  and  in  italics  are  the  normalized 
times  (the  running  time  divided  by  iVlgiV)  in  microseconds.  With  2^"  bytes  of  memory,  a  2^^-point  FFT 
fits  in  memory,  so  this  timing  is  omitted. 
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Tukey  on  zayante.  The  operating  system  may  choose  to  run  a  ready  thread  on  either  processor, 
and  so  disk-server  threads  do  not  interfere  with  butterfly  computations  as  much  as  on  zayante. 
Consequently,  on  adams  it  is  always  faster  to  use  asynchronous  I/O  than  to  use  synchronous  I/O. 

8  Conclusion 

We  have  examined  both  analytically  and  experimentally  two  classes  of  methods  for  computing  large 
Fourier  transforms.  In-core  FFT  algorithms  run  slowly  when  they  execute  in  a  demand-paging 
environment.  Of  the  three  that  we  examined,  Swarztrauber’s  method  is  by  far  the  fastest  and  has 
the  best  locality  of  reference.  The  explicit  out-of-core  method  that  we  developed  for  the  PDM  is 
asymptotically  optimal  in  this  model,  and  it  has  good  empirical  performance.  On  a  DEC  2100 
server  with  two  processors,  large  memory,  and  eight  data  disks,  our  algorithm’s  normalized  time  is 
competitive  with  in-core  methods,  even  when  they  run  entirely  in  memory. 

Although  it  uses  both  processors,  our  current  DEC  2100  implementation  is  essentially  a  unipro¬ 
cessor  implementation.  Our  own  breakdowns  of  running  times  on  large  problems  show  that  com¬ 
putation  time  is  a  bottleneck.  We  plan  to  investigate  true  parallel  out-of-core  algorithms,  using 
parallelized  versions  of  the  permutation  methods  described  in  this  paper. 
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